Community structure in directed networks 
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We consider the problem of finding communities or modules in directed networks. The most 
common approach to this problem in the previous literature has been simply to ignore edge direction 
and apply methods developed for community discovery in undirected networks, but this approach 
discards potentially useful information contained in the edge directions. Here we show how the 
widely used benefit function known as modularity can be generalized in a principled fashion to 
incorporate the information contained in edge directions. This in turn allows us to find communities 
by maximizing the modularity over possible divisions of a network, which we do using an algorithm 
based on the eigenvectors of the corresponding modularity matrix. This method is shown to give 
demonstrably better results than previous methods on a variety of test networks, both real and 
computer-generated. 



At the most fundamental level a network consists of 
a set of nodes or vertices connected in pairs by lines 
or edges, but many variations and extensions are pos- 
sible, including networks with directed edges, weighted 
edges, labels on nodes or edges, and others. This flexi- 
ble structure lends itself to the modeling of a wide array 
of complex systems and networks have, as a result, at- 
tracted considerable attention in the recent physics liter- 
ature 

Many networks are found to display "community struc- 
ture," dividing naturally into communities or modules 
with dense connections within communities but sparser 
connections between them. Communities have proven 
to be of interest both in their own right as functional 
building blocks within networks and for the insights they 
offer into the dynamics or modes of formation of net- 
works, and a large volume of research has been devoted 
to the development of algorithmic tools for discovering 
communities — see @ for a review. Nearly all of these 
methods, however, have one thing in common: they are 
intended for the analysis of undirected network data. 
Many of the networks that we would like to study are 
directed, including the world wide web, food webs, many 
biological networks, and even some social networks. The 
commonest approach to detecting communities in di- 
rected networks has been simply to ignore the edge di- 
rections and apply algorithms designed for undirected 
networks. This works reasonably well in some cases, al- 
though in others it does not, as we will see in this paper. 
Even in the cases where it works, however, it is clear 
that in discarding the directions of edges we are throw- 
ing away a good deal of information about our network's 
structure, information that, at least in principle, could 
allow us to make a more accurate determination of the 
communities. 

Several previous studies, including our own, have 
touched on this problem in the context of other analyses 
of directed network data [1, 0, B [1] j but they have typ- 
ically not tackled the community structure problem di- 
rectly. In this paper we propose a method for the discov- 
ery of communities in directed networks that makes ex- 
plicit use of the information contained in edge directions. 



The method we propose is an extension of the well es- 
tablished modularity optimization method for undirected 
networks fio'l , a method that has been shown to be both 
computationally efficient and highly effective in practical 
applications 0]. 

The premise of the modularity optimization method 
is that a good division of a network into communities 
will give high values of the benefit function Q, called the 
modularity, defined by pd| 

Q = (fraction of edges within communities) 

— (expected fraction of such edges). (1) 

Large positive values of the modularity indicate when a 
statistically surprising fraction of the edges in a network 
fall within the chosen communities; it tells us when there 
are more edges within communities than we would expect 
on the basis of chance. 

The expected fraction of edges is typically evaluated 
within the so-called configuration model, a random graph 
conditioned on the degree sequence of the original net- 
work, in which the probability of an edge between two 
vertices i and j is kikj/2m, where ki is the degree of ver- 
tex i and m is the total number of edges in the network. 
The modularity can then be written 
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where Aij is an element of the adjacency matrix, 5ij is 
the Kronecker delta symbol, and Ci is the label of the 
community to which vertex i is assigned. Then one max- 
imizes Q over possible divisions of the network into com- 
munities, the maximum being taken as the best estimate 
of the true communities in the network. Neither the size 
nor the number of communities need be fixed; both can 
be varied freely in our attempt to find the maximum. 

In practice, the exhaustive optimization of modular- 
ity is computationally hard, known to be NP-complete 
over the set of all graphs of a given size so prac- 
tical methods based on modularity optimization make 
use of approximate optimization schemes such as greedy 
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algorithms, simulated annealing, spectral methods, and 
others. 

Now consider a directed network. In searching for com- 
munities in such a network we again look for divisions of 
the network in which there are more edges within com- 
munities than we expect on the basis of chance, but we 
now take edge direction into account. The crucial point 
to notice is that the expected positions of edges in the 
network depend on their direction. Consider two vertices, 
A and B. Vertex A has high out-degree but low in-degree 
while vertex B has the reverse situation. This means that 
a given edge is more likely to run from A to B than vice 
versa, simply because there are more ways it can fall in 
the first direction than in the second. Hence if we observe 
in our real network that there is an edge from B to A, 
it should be considered a bigger surprise than an edge 
from A to B and hence should make a bigger contribu- 
tion to the modularity, since modularity should be high 
for statistically surprising configurations. 

We put these insights to work as follows. Given the 
joint in/out-degree sequence of our directed network, 
we can create a directed equivalent of the configuration 
model, which will have an edge from vertex j to vertex i 
with probability k™k°^'^ /m, where fc™ and fc°"' are the 
in- and out-degrees of the vertices. (Note that there is no 
factor of 2 in the denominator now.) Then the equivalent 
of Eq. (HI) is 
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and we have made use of - Aij = ^ - fc™ = fc™* = 
m. Our goal is now to find the s that maximizes Q for a 
given B. 

In the undirected case the modularity matrix is sym- 
metric but in the present case it is, in general, not, and 
the lack of symmetry will cause technical problems if we 
blindly attempt to duplicate the eigenvector-based ma- 
chinery presented for undirected networks in p^ . Luck- 
ily, however, we can easily restore symmetry to our prob- 
lem by adding ^ to its own transpose to give 
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The matrix B -f B"^ is now manifestly symmetric and 
it is on this symmetric matrix that we focus forthwith. 
Notice that B -t- B^ is not the same as the modularity 
matrix for a symmetrized version of the network in which 
direction is ignored and hence we expect methods based 
on the true directed modularity to give different results, 
in general, to methods based on the undirected version. 

The leading constant 1 /4m in Eq. ([6]) is conventional, 
but makes no difference to the position of the maximum 
of Q, so for the sake of clarity we neglect it in defining 
our optimization procedure. 

Following , we now write s as a linear combination 
of the eigenvectors of B + B-^ thus: s = a^Vi with 
fli = • s. Then 

Q = Y. "'^^B + B^) a,v, = f3.{vj ■ s)2, (7) 



where Aij is defined in the conventional manner to be 1 
if there is an edge from j to i and zero otherwise. Note 
that indeed edges j i make larger contributions to this 
expression if fcj" and/or is small. 

Now we search for the division of the network into 
communities {ci} such that Q is maximized. One can in 
principle make use of any of the methods previously ap- 
plied to modularity maximization, such as simulated an- 
nealing or greedy algorithms. Here we derive the appro- 
priate generalization of the spectral optimization method 
of Newman , which is both computationally efficient 
and appears to give excellent results in practice. 

We consider first the simplified problem of dividing a 
directed network into just two communities. We define 
Si to be if vertex i is assigned to community 1 and —1 
if it is assigned to community 2. Note that this implies 
that J2i — Then 5ci,cj — '^{siSj + 1) and 
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where s is the vector whose elements are the Si, B is the 
so-called modularity matrix with elements 
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where (3i is the eigenvalue of H + corresponding to 
eigenvector v^. Let us assume the eigenvalues to be la- 
beled in decreasing order /3i > /?2 >...>/?« . Under the 
normalization constraint S"^ • s = n the maximum of Q is 
achieved when s is parallel to the leading eigenvector Vi , 
but normally this solution is forbidden by the additional 
condition that Si = ±1. We do the best we can, how- 
ever, and make s as close as possible to parallel with vi, 
meaning we choose the value of s that maximizes • s. 
It is straightforward to show that this gives Si = -t-1 if 
tip'' > and Si = —1 if up'' < 0, where wp'' is the zth 

element of vi. (If tp'' — then Si = ±1 are equally good 
solutions to the maximization problem.) 

Thus we arrive at a simple algorithm for splitting a 
network: we calculate the eigenvector corresponding to 
the largest positive eigenvalue of the symmetric matrix 
B -I- B"^ and then assign communities based on the signs 
of the elements of the eigenvector. 

As in the undirected case, the spectral method typi- 
cally provides an excellent guide to the broad outlines 
of the optimal partition, but may err in the case of in- 
dividual vertices, a situation that can be remedied by 
adding a "fine-tuning" stage to the algorithm in which 
vertices are moved back and forth between communities 
in an effort to increase the modularity, until no further 
improvements can be made [Tsj . We have incorporated 
such a fine-tuning in all the calculations presented here. 
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So far we have discussed the division of a network into 
two communities. There are a variety of ways of gener- 
ahzing the approach to more than two communities but 
the simplest, which we adopt here, is repeated bisection. 
That is, we first divide the network into two groups using 
the algorithm above and then divide those groups and so 
forth. The process stops when we reach a point at which 
further division does not increase the total modularity of 
the network. 

The subdivision of a community contained within a 
larger network requires a slight generalization of the 
method above. Consider the change in modularity AQ 
of an entire network when a community g within it is 
subdivided and, defining Si as before for vertices in g, we 
find 
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where we have made use of s| = 1 for all i and 
Bif = Bij - 5ij ^ Bik- 

keg 



(8) 
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In other words, B^^^ is the submatrix of B for the sub- 
graph g with the sum of each row subtracted from the 
corresponding diagonal element. Although B^''^ like B, 

is in general asymmetric, the sum B*^^) -I- B*^^) is sym- 
metric and hence Eq. ([5]) has the same functional form 
as Eq. © and we can apply the same method to maxi- 
mize AQ. 

Our complete algorithm for discovering communities or 
groups in a directed network is thus as follows. We con- 
struct the modularity matrix, Eq. ([5]), for the network 
and find the most positive eigenvalue of the symmetric 
matrix B + B^ and the corresponding eigenvector. Each 
vertex is assigned to one of two groups depending on the 
sign of the corresponding element of the eigenvector and 
then we fine-tune the assignments as described above to 
maximize the modularity. We then further subdivide the 
communities using the same method, but with the gener- 
alized modularity matrix, Eq. Q, fine tuning after each 
division. If the algorithm finds no division giving a posi- 
tive value of AQ for a particular community then we can 
increase the modularity no further by subdividing this 
community and we leave it alone. When all communities 
reach this state the algorithm ends. 

We now give a number of examples of the application 
of our method. We consider four different directed net- 
works of varying degrees of complexity, starting with a 
relatively simple but important example: the world wide 
web. 

Weblogs or "blogs" are personal web sites on which 
their proprietors record brief thoughts on topics of their 
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FIG. 1: Community assignments for the two-community ran- 
dom network described in the text from (a) a standard undi- 
rected modularity maximization which ignores edge direction 
and (b) the algorithm of this paper. The shaded regions rep- 
resent the communities discovered by the algorithms. The 
true community assignments are denoted by vertex shape. 



choosing, often with links to other blogs with related dis- 
cussions. In a recent study, Adamic and Glance [l^ 
looked at a network of 1225 blogs focusing on US pol- 
itics. In this network the vertices represent the blogs 
and there is a directed link between vertices if one blog 
links to another. Adamic and Glance also characterized 
the political persuasion of each blog as conservative or 
liberal based on textual content. 

When fed into our community finding algorithm, the 
blog network divides into two clear communities, with 
one being composed almost entirely of conservative blogs 
and the other of liberal blogs. (The algorithm places 
97% of the blogs characterized by Adamic and Glance 
as conservative in the first community and 93% of those 
characterized as liberal in the second.) The algorithm 
finds no subdivision of either community that gives any 
increase in the modularity, indicating that the network 
consists of only two tightly knit communities correspond- 
ing closely to the traditional left-right division of US pol- 
itics. This serves as a particularly clear demonstration 
of the algorithm's ability to find meaningful structure in 
network data. But on the other hand this particular net- 
work gives very similar results when analyzed using the 
undirected form of the spectral modularity algorithm, in 
which edge direction is entirely ignored (l3[. The prin- 
cipal interest in our algorithm derives from its ability to 
find structure in networks where the simpler undirected 
version fails, so let us turn to examples of this kind. 

For illustrative purposes, we first consider an artifi- 
cial computer-generated network, designed specifically to 
test the performance of the algorithm. In this network of 
32 vertices, vertex pairs are connected by edges indepen- 
dently and uniformly at random with some probability p. 
The edges are initially undirected. The network is then 
divided into two groups of 16 vertices each and edges that 
fall within groups are assigned directions at random but 
edges between groups are biased so that they are more 
likely to point from group 1 to group 2 than vice versa. 



FIG. 2; Community assignments for the three-community 
random network described in the text as generated by 
(a) standard undirected modularity maximization and (b) the 
algorithm of this paper. 



FIG. 3: The network of technical terms described in the text 
along with the community assignments determined by a stan- 
dard undirected modularity maximization (boxes) and the al- 
gorithm of this paper (shaded groups). 



By construction, there is no community structure to 
be found in this network if we ignore edge directions — 
the positions of the edges are entirely random — and this 
is confirmed in Fig. [1^, which shows the results of the 
application of the undirected modularity maximization 
algorithm. If we take the directions into account, how- 
ever, using the algorithm presented in this paper, the 
two communities are detected almost perfectly: just one 
vertex out of 32 is misclassified — see Fig. [TJd. 

Even in networks where there is clear community struc- 
ture contained in the positions of the edges it is still pos- 
sible for the directions to contain additional useful infor- 
mation. As an example of this type of behavior, consider 
the network shown in Fig. [21 which has 32 vertices and 
three communities. For two of the communities, con- 
taining 14 vertices each, there is a high probability of 
connection between pairs of vertices that fall in the same 
community but a lower probability if one of the vertices 
is in a different community. Structure of this kind, in 
which edge direction does not play a role, can in princi- 
ple be found by algorithms designed for undirected net- 
works. The third community, however, is different. It 
has four vertices, each of which has a high probability 
of connection to every other vertex. The only feature 
that distinguishes this third community as separate is 
the direction of its edges — two of the four vertices have 
high probability of ingoing edges, the other two have high 
probability of outgoing edges, and there are also a small 
number of additional edges running from the former to 
the latter. These last edges are statistically surprising 
in the sense considered here and hence tend to bind the 
third community together. 

Applied to this network, the standard undirected com- 
munity detection algorithm finds the two large communi- 
ties with ease, but the remaining community is not found 
and its vertices are dispersed by the algorithm among the 
other communities (Fig. [2^). Our directed algorithm, on 
the other hand, finds all three communities without dif- 
ficulty (Fig. [^b). Again the algorithm has made use of 



information contained in the edge directions to identify 
community structures not accessible to previous meth- 
ods. 

Returning now to real-world networks, we show in 
Fig. [3] a further example of the performance of our algo- 
rithm on, in this word network. The network rep- 
resents connections between a set of technical terms, such 
as "vertex" and "edge," contained in a glossary of net- 
work jargon derived from recent review papers by New- 
man p^l and Boccaletti et al. Q . Vertices in this network 
represent terms and there is a directed edge from one 
vertex to another if the first glossary term was used in 
the definition of the second. Because circular definitions 
are unhelpful and normally avoided, most edges in the 
network are not reciprocated. 

Figure [3] shows the communities found in this network 
by our directed modularity algorithm. The algorithm 
finds six communities in this case that appear to cor- 
respond to groupings of terms clustered around a few 
basic concepts. For instance, one group deals with words 
describing basic network structure, such as "edge" and 
"graph," while another deals with terms describing di- 
rected networks. A third group contains the terms "ver- 
tex" and "degree" and related concepts and the remain- 
ing groups are associated with clustering, communities, 
and paths respectively. Thus, the algorithm again ap- 
pears to find meaningful structure in the network, of the 
type that could be useful in understanding the broader 
shape of otherwise poorly understood systems. 

We have also applied the undirected modularity max- 
imization algorithm to this same network, which results 
in four groups. Two of these are closely similar to ones 
found by the directed algorithm — the groups dealing with 
edges and with directed networks. The other groups, 
however, contain a mix of terms that do not correspond 
closely to any obvious network concepts, with words like 
"vertex," "diameter," "cycle," and "motif" grouped to- 
gether. As discussed above, the undirected algorithm 
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has less information at its disposal, the directions of the 
edges having been discarded, so it is natural that it is un- 
able to detect some of the structure found by its directed 
counterpart. 

In summary, we have presented a method for detect- 
ing community structure in directed networks that makes 
explicit use of information contained in edge directions, 
information that most other algorithms discard. Our 
method is an extension of the established modularity 
maximization method widely used to determine commu- 
nity structure in undirected networks. We have applied 
the method to a variety of networks, both real and sim- 
ulated, showing that it is able to recover known commu- 
nity structure in previously studied networks and extract 



additional and revealing structural information not avail- 
able to algorithms that ignore edge direction. The com- 
putational efficiency of the algorithm is essentially identi- 
cal to that of the corresponding algorithm for undirected 
networks and hence we see no reason to continue to use 
the undirected algorithm on directed graphs; we recom- 
mend the use of the full directed algorithm in all cases 
where researchers wish to analyze both edge placement 
and edge direction. 
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